REPORT  DOCUMENTATION  PAGE 

Form  Approved 

OBMNo  0704-0186 

Pudk  vooning  burden  for  the  collection  of  mformacn  is  estimated  to  average  1  hour  per  response,  indudmg  the  limB  tar  revewng  rotrucfians.  searching  muting  data  sources,  gathering  and 
ma<nia.n,ng  me  data  needed,  and  completing  and  reviewing  the  ooMection  of  information.  Send  comment  ragardmg  fts  burden  et&mrte  or  any  other  «pecf  orl  the  cotecoon  ot  mformaton, 

•nciuO' ■'■d  suggestions  for  reducing  this  Durden,  to  Wasftngton  Headquarters  Services.  Directorate  tor  fntormMton  Operations  and  fteporm.  1215  Jefferson  Qbms  Highway,  Su«e  1204.  Artngson, 

VA  ^2-4302.  and  to  the  OMice  of  Manaaemem  and  Budaei.  Paperwork  Reducton  Protect  (0704-0108),  Washriaton.  OC  20503. 

1  AGENCY  USE  ONLY  (Leave  Blank) 

2.  REPORT  DATE 

August  1993 

3.  REPORT  TYPE  AND  DATES  COVERED 
memorandum 

4.  TITLE  AND  SUBTITLE 

5.  FUNDING  NUMBERS 

On  the  Convergence  of  Stochastic  Iterative  Dynamic  Programming 
Algorithms 


6  AUTHOR(S) 

Tommi  Jaakkola,  Michael  I.  Jordan,  and  Satinder  P.  Singh 


NSPASC-9217041 
N000 1 4-90- J- 1942 
NSF'ECS-92)  6531 
IRI-9013991 


7  PERFORMING  ORGANIZATION  NAME(S)  AND  ADORESS(ES) 

Massachusetts  Institute  of  Technology 
Artificial  Intelligence  Laboratory 
545  Technology  Square 
Cambridge,  Massachusetts  02139 


8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


AIM  1441 
CBCL  84 


9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESSEES) 

Office  of  Naval  Research 
Information  Systems 
Arlington,  Virginia  22217  ^  . 

'  .LEo 


10.  SPONSORING/MONITORING 
AGENCY  REPORT  NUMBER 


11.  SUPPLEMENTARY  NOTES  1U  i§ 

None 

12a  DISTRIBUTION/AVAILABIUTY  STATEMENT _ 

DISTRIBUTION  UNLIMITED!  Approved  for  pabilw  roiacaa; 

1  aintjtb«fVs;  p- lYrntaU 

12b.  DISTRIBUTION  COOE 

Recent  developments  in  the  area  of  reinforcement  learning  hav  yielded  a  number  of  new  algorithms  for 
the  prediction  and  control  of  Markovian  environments.  These  algorithms,  including  the  TD  lambda) 
algorithm  of  Sutton  (1988)  and  the  Q-leaming  algorithm  of  Watkins  (1989),  can  be  motivated 
heuristically  as  approximations  to  dynamic  programming  (DP).  In  this  paper  we  provide  a  rigorous 
proof  of  convergence  of  these  DP-based  learning  algorithms  by  relating  them  to  the  powerful  techniques 
of  stochastic  approximation  theory  via  a  new  convergence  theorem.  The  theorem  establishes  a  general 
class  of  convergent  algorithms  to  which  both  TD(lambda)  and  Q-leaming  belong. 


14  SUBJECT  TERMS 

reinforcement  learning  stochastic  approximation 
convergence  dynamic  programming 


15.  NUMBER  OF  PAGES 

15 


16.  PRICE  CODE 


17.  SECURITY  CLASSIFICATION 
OF  REPORT 


18.  SECURITY  CLASSIFICATION 
OF  THIS  PAGE 


19.  SECURITY  CLASSIFICATION 
OF  ABSTRACT 


20.  LIMITATION  OF 
ABSTRACT 


UNCLASSIFIED 
TtSN  7BBKH-gyS5M - 


UNCLASSIFIED 


UNCLASSIFIED 


UNCLASSIFIED 


Standard  Km!  aw  (Hev.  788) 
Pmertad  by  ANSI  SMI  239-18 
29a  102 


DTIC  QUALITY  INSPECT  CD  3 


AD-A276  517 


Best 

Available 

Copy 


94-07595 


» 

MASSACHUSETTS  INSTITUTE  OF  TECHNOLOGY 
ARTIFICIAL  INTELLIGENCE  LABORATORY 

and 

CENTER  FOR  BIOLOGICAL  AND  COMPUTATIONAL  LEARNING 
DEPARTMENT  OF  BRAIN  AND  COGNITIVE  SCIENCES 

A. I.  Memo  No.  1441  August  6,  1993 

C.B.C.L.  Memo  No.  84 

On  the  Convergence  of  Stochastic  Iterative 
Dynamic  Programming  Algorithms 

Tommi  Jaakkola,  Michael  I.  Jordan  and  Satinder  P.  Singh 

Abstract 

Recent  developments  in  the  area  of  reinforcement  learning  have  yielded  a  number  of  new  algorithms  for 
the  prediction  and  control  of  Markovian  environments.  These  algorithms,  including  the  TD(A)  algorithm 
of  Sutton  (1988)  and  the  Q-learning  algorithm  of  Watkins  (1989),  can  be  motivated  heuristically  as  ap¬ 
proximations  to  dynamic  programming  (DP).  In  this  paper  we  provide  a  rigorous  proof  of  convergence 
of  these  DP-based  learning  algorithms  by  relating  them  to  the  powerful  techniques  of  stochastic  approx¬ 
imation  theory  via  a  new  convergence  theorem.  The  theorem  establishes  a  general  class  of  convergent 
algorithms  to  which  both  TD(A)  and  Q-learning  belong. 


Copyright  ©  Massachusetts  Institute  of  Technology,  1993 


This  report  describes  research  done  at  the  Dept,  of  Brain  and  Cognitive  Sciences,  the  Center  for  Biological  and 
Computational  Learning,  and  the  Artificial  Intelligence  Laboratory  of  the  Massachusetts  Institute  of  Technology. 
Support  for  CBCL  is  provided  in  part  by  a  grant  from  the  NSF  (ASC-9217041).  Support  for  the  laboratory’s 
artificial  intelligence  research  is  provided  in  part  by  the  Advanced  Research  Projects  Agency  of  the  Dept,  of 
Defense.  The  authors  were  supported  by  a  grant  from  the  McDonnell-Pew  Foundation,  by  a  grant  from  ATR 
Human  Information  Processing  Research  Laboratories,  by  a  grant  from  Siemens  Corporation,  by  by  grant  IRI- 
9013991  from  the  National  Science  Foundation,  by  grant  N00014-90-J-1942  from  the  Office  of  Naval  Research, 
and  by  NSF  grant  ECS-9216531  to  support  an  Initiative  in  Intelligent  Control  at  MIT.  Michael  I.  Jordan  is  a 
NSF  Presidential  Young  Investigator. 


An  important  component  of  many  real  world  learning  problems  is  the  temporal  credit  as¬ 
signment  problem — the  problem  of  assigning  credit  or  blame  to  individual  components  of  a 
temporally-extended  plan  of  action,  based  on  the  success  or  failure  of  the  plan  as  a  whole.  To 
solve  such  a  problem,  the  learner  must  be  equipped  with  the  ability  to  assess  the  long-term 
consequences  of  particular  choices  of  action  and  must  be  willing  to  forego  an  immediate  payoff 
for  the  prospect  of  a  longer  term  gain.  Moreover,  because  most  real  world  problems  involving 
prediction  of  the  future  consequences  of  actions  involve  substantial  uncertainty,  the  learner 
must  be  prepared  to  make  use  of  a  probability  calculus  for  assessing  and  comparing  actions. 

There  has  been  increasing  interest  in  the  temporal  credit  assignment  problem,  due  princi¬ 
pally  to  the  development  of  learning  algorithms  based  on  the  theory  of  dynamic  programming 
(DP)  (Barto,  Sutton,  &  Watkins,  1990;  Werbos,  1992).  Sutton’s  (1988)  TD(A)  algorithm 
addressed  the  problem  of  learning  to  predict  in  a  Markov  environment,  utilizing  a  temporal 
difference  operator  to  update  the  predictions.  Watkins’  (1989)  Q-learning  algorithm  extended 
Sutton’s  work  to  control  problems,  and  also  clarified  the  ties  to  dynamic  programming. 

In  the  current  paper,  our  concern  is  with  the  stochastic  convergence  of  DP-based  learning 
algorithms.  Although  Watkins  (1989)  and  Watkins  and  Dayan  (1992)  proved  that  Q-learning 
converges  with  probability  one,  and  Dayan  (1992)  observed  that  TD(0)  is  a  special  case  of  Q- 
learning  and  therefore  also  converges  with  probability  one,  these  proofs  rely  on  a  construction 
that  is  particular  to  Q-learning  and  fail  to  reveal  the  ties  of  Q-leaming  to  the  broad  theory  of 
stochastic  approximation  (e.g.,  Wasan,  1969).  Our  goal  here  is  to  provide  a  simpler  proof  of 
convergence  for  Q-learning  by  making  direct  use  of  stochastic  approximation  theory.  We  also 
show  that  our  proof  extends  to  TD(A)  for  arbitrary  A.  Several  other  authors  have  recently 
presented  results  that  are  similar  to  those  presented  here:  Dayan  and  Sejnowski  (1993)  for 
TD(A),  Peng  and  Williams  (1993)  for  TD(A),  and  Tsitsiklis  (1993)  for  Q-learning.  Our  results 
appear  to  be  closest  to  those  of  Tsitsiklis  (1993). 

We  begin  with  a  general  overview  of  Markovian  decision  problems  and  DP.  We  introduce 
the  Q-learning  algorithm  as  a  stochastic  form  of  DP.  We  then  present  a  proof  of  convergence 
for  a  general  class  of  stochastic  processes  of  which  Q-learning  is  a  special  case.  We  then  discuss 
TD(A)  and  show  that  it  is  also  a  special  case  of  our  theorem. 


Markovian  decision  problems 


A  useful  mathematical  model  of  temporal  credit  assignment  problems,  studied  in  stochastic 
control  theory  (Aoki,  1967)  and  operations  research  (Ross,  1970),  is  the  Markovian  decision 
problem.  Markovian  decision  problems  are  built  on  the  formalism  of  controlled  Markov  chains. 
Let  5  =  1, 2, . . . ,  N  be  a  discrete  state  space  and  let  U (t)  be  the  discrete  set  of  actions  available 
to  the  learner  when  the  chain  is  in  state  i.  The  probability  of  making  a  transition  from  state  i 
to  state  j  is  given  by  Pij(u),  where  u  G  U(i).  The  learner  defines  a  policy  p,  which  is  a  function 
from  states  to  actions.  Associated  with  every  policy  p  is  a  Markov  chain  defined  by  the  state 
transition  probabilities  Pij(p(i)). 

There  is  an  instantaneous  cost  c;(u)  associated  with  each  state  :  and  action  tt,  where  c<(u) 
is  a  random  variable  with  expected  value  c,(u).  We  also  define  a  value  function  VM(i),  which  is 
the  expected  sum  of  discounted  future  costs  given  that  the  system  begins  in  state  i  and  follows 
policy  p: 
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where  st  €  S  is  the  state  of  the  Markov  chain  at  time  t.  Future  costs  are  discounted  by  a  factor 
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7*,  where  7  6  (0, 1).  We  wish  to  find  a  policy  that  minimizes  the  value  function: 


Vm(i)  =  minVV(i). 
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Such  a  policy  is  referred  to  as  an  optimal  policy  and  the  corresponding  value  function  is  referred 
to  as  the  optimal  value  function.  Note  that  the  optimal  value  function  is  unique,  but  an  optimal 
policy  need  not  be  unique. 

Markovian  decision  problems  can  be  solved  by  dynamic  programming  (Bertsekas,  1987). 
The  basis  of  the  DP  approach  is  an  equation  that  characterizes  the  optimal  value  function. 
This  equation,  known  as  Bellman’s  equation,  characterizes  the  optimal  value  of  the  state  in 
terms  of  the  optimal  values  of  possible  successor  states: 

Vm(i)  =  mm  {<•.-(«)  +  7  E  PMv’(j)h  (3) 

u€t/(.)  j€S 


To  motivate  Bellman’s  equation,  suppose  that  the  system  is  in  state  t  at  time  t  and  consider 
how  Vm(i)  should  be  characterized  in  terms  of  possible  transitions  out  of  state  i.  Suppose  that 
action  u  is  selected  and  the  system  transitions  to  state  j.  The  expression  c,(u)  +  7 V*(j)  is 
the  cost  of  making  a  transition  out  of  state  t  plus  the  discounted  cost  of  following  an  optimal 
policy  thereafter.  The  minimum  of  the  expected  value  of  this  expression,  over  possible  choices 
of  actions,  seems  a  plausible  measure  of  the  optimal  cost  at  i  and  by  Bellman’s  equation  is 
indeed  equal  to  Vm(i). 

There  are  a  variety  of  computational  techniques  available  for  solving  Bellman’s  equation. 
The  technique  that  we  focus  on  in  the  current  paper  is  a  iterative  algorithm  known  as  value 
iteration.  Value  iteration  solves  for  V*(i)  by  setting  up  a  recurrence  relation  for  which  Bellman’s 
equation  is  a  fixed  point.  Denoting  the  estimate  of  V“(t)  at  the  kth  iteration  as  VW(»),  we 
have: 


V<*+1)(t) 


min 
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This  iteration  can  be  shown  to  converge  to  V*(i)  for  arbitrary  initial  V^(i)  (Bertsekas,  1987). 
The  proof  is  based  on  showing  that  the  iteration  from  V'*)(t)  to  V^k+l^(i)  is  a  contraction 
mapping.  That  is,  it  can  be  shown  that: 


max  |V(fc+1>(i)  -  V(*)|  <  7  max  |V<*>(«)  -  V(»)|,  (5) 

t  « 


which  implies  that  VW(t)  converges  to  V*(i)  and  also  places  an  upper  bound  on  the  convergence 
rate. 

Watkins  (1989)  utilized  an  alternative  notation  for  expressing  Bellman’s  equation  that  is 
particularly  convenient  for  deriving  learning  algorithms.  Define  the  function  Qm(i,u)  to  be  the 
expression  appearing  inside  the  “min”  operator  of  Bellman’s  equation: 

<?*(*, «)  =  c,(u)  +  7  E  P.j(«)V~(j)  (6) 

i€S 

Using  this  notation  Bellman’s  equation  can  be  written  as  follows: 

V’(i )=  nun  Q"(t,u).  (7) 

Moreover,  value  iteration  can  be  expressed  in  terms  of  Q  functions: 

Q<fc+1>(«,  u)  =  c,(u)  +  7  E  P^W(ki(j),  (8) 
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where  V^k\i)  is  defined  in  terms  of  Q^(i, ti)  as  follows: 

Vlk\i)=  min  QW(i,u).  (9) 

u€l/(») 

The  mathematical  convenience  obtained  from  using  Q's  rather  than  V’s  derives  from  the  fact 
that  the  minimization  operator  appears  inside  the  expectation  in  Equation  8,  whereas  it  appears 
outside  the  the  expectation  in  Equation  4.  This  fact  plays  an  important  role  in  the  convergence 
proof  presented  in  this  paper. 

The  value  iteration  algorithm  in  Equation  4  or  Equation  8  can  also  be  executed  asyn¬ 
chronously  (Bertsekas  &  Tsitsiklis,  1989).  In  an  asynchronous  implementation,  the  update  of 
the  value  of  a  particular  state  proceeds  in  parallel  with  the  updates  of  the  values  of  other  states. 
Bertsekas  &  Tsitsiklis  (1989)  show  that  as  long  as  each  state  is  updated  infinitely  often  and 
each  action  is  tried  an  infinite  number  of  times  in  each  state,  then  the  asynchronous  algorithm 
eventually  converges  to  the  optimal  value  function.  Moreover,  asynchronous  execution  has  the 
advantage  that  it  is  directly  applicable  to  real-time  Markovian  decision  problems  (RTDP;  Barto, 
Bradtke,  &  Singh,  1993).  In  a  real-time  setting,  the  system  uses  its  evolving  value  function 
to  choose  control  actions  for  an  actual  process  and  updates  the  values  of  the  states  along  the 
trajectory  followed  by  the  process. 

Dynamic  programming  serves  as  a  starting  point  for  deriving  a  variety  of  learning  algorithms 
for  systems  that  interact  with  Markovian  environments  (Barto,  Bradtke,  &  Singh,  1993;  Sutton, 
1988;  Watkins,  1989).  Indeed,  real-time  dynamic  programming  is  arguably  a  form  of  learning 
algorithm  as  it  stands.  Although  RTDP  requires  that  the  system  possess  a  complete  model  of 
the  environment  (i.e.,  the  probabilities  Pij(u)  and  the  expected  costs  c,(u)  are  assumed  known), 
the  performance  of  a  system  using  RTDP  improves  over  time,  and  its  improvement  is  focused 
on  the  states  that  are  actually  visited.  The  system  “learns”  by  transforming  knowledge  in  one 
format  (the  model)  into  another  format  (the  value  function). 

A  more  difficult  learning  problem  arises  when  the  probabilistic  structure  of  the  environment 
is  unknown.  There  are  two  approaches  to  dealing  with  this  situation  (cf.  Barto,  Bradtke, 
&  Singh,  1993).  An  indirect  approach  acquires  a  model  of  the  environment  incrementally,  by 
estimating  the  costs  and  the  transition  probabilities,  and  then  uses  this  model  in  an  ongoing  DP 
computation.  A  direct  method  dispenses  with  constructing  a  model  and  attempts  to  estimate 
the  optimal  value  function  (or  the  optimal  Q-values)  directly.  In  the  remainder  of  this  paper, 
we  focus  on  direct  methods,  in  particular  the  Q-learning  algorithm  of  Watkins  (1989)  and  the 
TD(A)  algorithm  of  Sutton  (1988). 

The  Q-learning  algorithm  is  a  stochastic  form  of  value  iteration.  Consider  Equation  8,  which 
expresses  the  update  of  the  Q  values  in  terms  of  the  Q  values  of  successor  states.  To  perform 
a  step  of  value  iteration  requires  knowing  the  expected  costs  and  the  transition  probabilities. 
Although  such  a  step  cannot  be  performed  without  a  model,  it  is  nonetheless  possible  to  estimate 
the  appropriate  update.  For  an  arbitrary  V  function,  the  quantity  J2j^sPij(uWU)  can  be 
estimated  by  the  quantity  V(j),  if  successor  state  j  is  chosen  with  probability  Pij(u).  But 
this  is  assured  by  simply  following  the  transitions  of  the  actual  Markovian  environment,  which 
makes  a  transition  from  state  i  to  state  j  with  probability  p,j(u).  Thus  the  sample  value  of  V  at 
the  successor  state  is  an  unbiased  estimate  of  the  sum.  Moreover  c,(u)  is  an  unbiased  estimate 
of  c,( u).  This  reasoning  leads  to  the  following  relaxation  algorithm,  where  we  use  Qt(i,u)  and 
Ft(t)  to  denote  the  learner’s  estimates  of  the  Q  function  and  V  function  at  time  t,  respectively: 

Qi+i(sti  «t)  =  (1  -  at(su  Ut))Qt(st,iit)  +  att(st,  ut)[c„(ut)  +  7*t(st+i)]  (10) 
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where 

Vt{st+ 1)  =  min  Qt{st,  ut).  (11) 

u€C/(3f+i) 

The  variables  at(st,ut)  are  zero  except  for  the  state  that  is  being  updated  at  time  t. 

The  fact  that  Q-learning  is  a  stochastic  form  of  value  iteration  immediately  suggests  the  use 
of  stochastic  approximation  theory,  in  particular  the  classical  framework  of  Robbins  and  Monro 
(1951).  Robbins- Monro  theory  treats  the  stochastic  convergence  of  a  sequence  of  unbiased 
estimates  of  a  regression  function,  providing  conditions  under  which  the  sequence  converges  to 
a  root  of  the  function.  Although  the  stochastic  convergence  of  Q-learning  is  not  an  immediate 
consequence  of  Robbins- Monro  theory,  the  theory  does  provide  results  that  can  be  adapted 
to  studying  the  convergence  of  DP-based  learning  algorithms.  In  this  paper  we  utilize  a  result 
from  Dvoretzky’s  (1956)  formulation  of  Robbins- Monro  theory  to  prove  the  convergence  of  both 
Q-learning  and  TD(A). 


Convergence  proof  for  Q-learning 

Our  proof  is  based  on  the  observation  that  the  Q-learning  algorithm  can  be  viewed  as  a  stochas¬ 
tic  process  to  which  techniques  of  stochastic  approximation  are  generally  applicable.  Due  to  the 
lack  of  a  formulation  of  stochastic  approximation  for  the  maximum  norm,  however,  we  need  to 
slightly  extend  the  standard  results.  This  is  accomplished  by  the  following  theorem  the  proof 
of  which  is  given  in  Appendix  A. 

Theorem  1  A  random  iterative  process  An+i(x)  =  (1  -  a„(x))A„(z)  -I-  0n(x)Fn(x)  converges 
to  zero  w.p.l  under  the  following  assumptions: 

1 )  The  state  space  is  finite. 

2)  E„an(x)  =  oo,  £na£(x)  <  oo,  £n/?n(x)  =  cm,  £„/?2(x)  <  oo,  and  E{/3n(x)|Pn}  < 
E{an(x)|P„}  uniformly  w.p.l. 

3)  II  E{F„(x)|P„}  ||nr<  7  II  A„  II W,  Where  7  €  (0, 1). 

4)  Var{P„(x)|Pn}  <  C(  1+  ||  An  ||w)2,  where  C  is  some  constant. 

Here  Pn  =  {An,  An_1,...,Pn_i,...,Q„_i,...,/3„_i,...}  stands  for  the  past  at  step  n.  Fn{x), 
an(x)  and  /3n(x)  are  allowed  to  depend  on  the  past  insofar  as  the  above  conditions  remain  valid. 
The  notation  ||  ■  ||w  refers  to  some  weighted  maximum  norm. 

In  applying  the  theorem,  the  An  process  will  generally  represent  the  difference  between  a 
stochastic  process  of  interest  and  some  optimal  value  (e.g.,  the  optimal  value  function).  The 
formulation  of  the  theorem  therefore  requires  knowledge  to  be  available  about  the  optimal 
solution  to  the  learning  problem  before  it  can  be  applied  to  any  algorithm  whose  convergence  is 
to  be  verified.  In  the  case  of  Q-leaming  the  required  knowledge  is  available  through  the  theory 
of  DP  and  Bellman’s  equation  in  particular. 

The  convergence  of  the  Q-learning  algorithm  now  follows  easily  by  relating  the  algorithm  to 
the  converging  stochastic  process  defined  by  Theorem  l.1  In  the  form  of  the  theorem  we  have: 

‘We  note  that  the  theorem  is  more  powerful  than  is  needed  to  prove  the  convergence  of  Q-learning.  Its 
generality,  however,  allows  it  to  be  applied  to  other  algorithms  as  well  (see  the  following  section  on  TD(A). 
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Theorem  2  The  Q-leaming  algorithm  given  by 

Qt+i(st,ut)  =  (1  -  at(st,ut))Qt(st,ut)  +  Qt(s,,Ut)[cJt(ut)  +  7^t(s«+i)] 
converges  to  the  optimal  Qm(s,u)  values  if 

1 )  The  state  and  action  spaces  are  finite. 

2)  Yltat(siu)  —  00  and  I It  <*<($)?*)  <  oo  uniformly  w.p.l. 

3)  Var{cs(u)}  is  bounded. 

3)  7/7  =  1  all  policies  lead  to  a  cost  free  terminal  state  w.p.l. 


Proof.  By  subtracting  Q*(sitt)  from  both  sides  of  the  learning  rule  and  by  defining 
At(s,  u)  =  Qt($,  u )  -  Q*(s ,  u)  together  with 

Ft(s,  u)  =  c,(u)  +  7Vt(snext)  -  Qm(s,u)  (12) 

the  Q-leaming  algorithm  can  be  seen  to  have  the  form  of  the  process  in  theorem  1  with  /3t(s,  u)  = 
at(s,u). 

To  verify  that  Ft(s,  u)  has  the  required  properties  we  begin  by  showing  that  it  is  a  contraction 
mapping  with  respect  to  some  maximum  norm.  This  is  done  by  relating  Ft  to  the  DP  value 
iteration  operator  for  the  same  Markov  chain.  More  specifically, 

max  |E{ Ft(i,  u)}j  =  7  max  |  £p0(u)[Vt(;)  -  V(j')] | 

i 

<  7  max  Y  pa(u)  max  \Qt{j,  v)  -  Qm(j,  v)| 
j 

=  7  max  Y  Fij(u)VA(j)  =  T(V*)(i) 
j 

where  T  is  the  DP  value  iteration  operator  for  the  case  where  the  costs  associated  with  each 
state  are  zero.  If  7  <  1  the  contraction  property  of  T  and  thus  of  Ft  can  be  seen  directly 
from  the  above  formulas.  When  the  future  costs  are  not  discounted  (7  =  1)  but  the  chain  is 
absorbing  and  all  policies  lead  to  the  terminal  state  w.p.l  there  still  exists  a  weighted  maximum 
norm  with  respect  to  which  T  is  a  contraction  mapping  (see  e.g.  Bertsekas  &  Tsitsiklis,  1989). 

The  variance  of  Ft(s,  u)  given  the  past  is  within  the  bounds  of  theorem  1  as  it  depends  on 
Qt(s,  u)  at  most  linearly  and  the  variance  of  c,(u)  is  bounded. 

Note  that  the  proof  covers  both  the  on-line  and  batch  versions.  □ 


The  TD(A)  algorithm 

The  TD(A)  (Sutton,  1988)  is  also  a  DP-based  learning  algorithm  that  is  naturally  defined  in 
a  Markovian  environment.  Unlike  Q-learning,  however,  TD  does  not  involve  decision-making 
tasks  but  rather  predictions  about  the  future  costs  of  an  evolving  system.  TD(A)  converges  to 
the  same  predictions  as  a  version  of  Q-learning  in  which  there  is  only  one  action  available  at 
each  state,  but  the  algorithms  are  derived  from  slightly  different  grounds  and  their  behavioral 
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differences  are  not  well  understood.  In  this  section  we  introduce  the  algorithm  and  its  derivation. 
The  proof  of  convergence  is  given  in  the  following  section. 

Let  us  define  Vt(i)  to  be  the  current  estimate  of  the  expected  cost  incurred  during  the 
evolution  of  the  system  starting  from  state  i  and  let  c,  denote  the  instantaneous  random  cost 
at  state  i.  As  in  the  case  of  Q-learning  we  assume  that  the  future  costs  are  discounted  at  each 
state  by  a  factor  7.  If  no  discounting  takes  place  (7  =  1)  we  need  to  assume  that  the  Markov 
chain  is  absorbing,  that  is,  there  exists  a  cost  free  terminal  state  to  which  the  system  converges 
with  probability  one. 

We  are  concerned  with  estimating  the  future  costs  that  the  learner  has  to  incur.  One  way 
to  achieve  these  predictions  is  to  simply  observe  n  consecutive  random  costs  weighted  by  the 
discount  factor  and  to  add  the  best  estimate  of  the  costs  thereafter.  This  gives  us  the  estimate 

Vt{n](it)  =  cit  +  7c»t+i  +  72c.',+2  +  -  •  •  +  7n~1Ci«+B_1  +  7nVt(tt+n)  (13) 

The  expected  value  of  this  can  be  shown  to  be  a  strictly  better  estimate  than  the  current 
estimate  is  (Watkins,  1989).  In  the  undiscounted  case  this  holds  only  when  n  is  larger  than 
some  chain-dependent  constant.  To  demonstrate  this  let  us  replace  Vt  with  V"  in  the  above 
formula  giving  E{Ft*^n^(t't)}  =  V'(it)  which  implies 

max|E{Vr/n^(i)}  -  V*(t)|  <  7nmaxPr{mj  >  n}  max|Vi(t)  -  V*(i)|  (14) 

t  t  X 

where  m{  is  the  number  of  steps  in  a  sequence  that  begins  in  state  *  (infinite  in  the  non¬ 
absorbing  case).  This  implies  that  if  either  7  <  1  or  n  is  large  enough  so  that  the  chain 
can  terminate  before  n  steps  starting  from  an  arbitrary  initial  state  then  the  estimate  v/7^  is 
strictly  better  than  Vt.  In  general,  the  larger  n  the  more  unbiased  the  estimate  is  as  the  effect 
of  incorrect  Vt  vanishes.  However,  larger  n  increases  the  variance  of  the  estimate  as  there  are 
more  (independent)  terms  in  the  sum. 

Despite  the  error  reduction  property  of  the  truncated  estimate  it  is  difficult  to  calculate  in 
practice  as  one  would  have  to  wait  n  steps  before  the  predictions  could  be  updated.  In  addition 
it  clearly  has  a  huge  variance.  A  remedy  to  these  problems  is  obtained  by  constructing  a  new 
estimate  by  averaging  over  the  truncated  predictions.  TD(A)  is  based  on  taking  the  geometric 
average: 

=  (15) 

n=l 

As  a  weighted  average  it  is  still  a  strictly  better  estimate  than  l^(t)  with  the  additional  benefit 
of  being  better  in  the  undiscounted  case  as  well  (as  the  summation  extends  to  infinity).  Fur¬ 
thermore,  we  have  introduced  a  new  parameter  A  which  affects  the  trade-off  between  the  bias 
and  variance  of  the  estimate  (Watkins,  1989).  An  increase  in  A  puts  more  weight  on  less  biased 
estimates  with  higher  variances  and  thus  the  bias  in  decreases  at  the  expense  of  a  higher 
variance. 

The  mathematical  convenience  of  using  the  geometric  average  can  be  seen  as  follows.  Given 
the  estimates  V(A(i)  the  obvious  way  to  use  them  in  a  learning  rule  is 

Vm(*t)  =  Vt(it)  +  ^ft)  -  Vt{it)\  (16) 

In  terms  of  prediction  differences,  that  is 

At(it)  =  c„  4-  7^<(*t+i)  “  V*(*t)  (17) 
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the  geometric  weighting  allows  us  to  write  the  correction  term  in  the  learning  rule  as 

V£A(it)  -  Vt(»t)  =  At(»t)  +  (A7)At(.t+1)  +  (A7)2At(it+2)  +  . . .  (18) 

Note  that  up  to  now  the  prediction  differences  that  need  to  be  calculated  in  the  future  depend  on 
the  current  Vt(i).  If  the  chain  is  nonabsorbing  this  computational  implausibility  can,  however, 
be  overcome  by  updating  the  predictions  at  each  step  with  the  prediction  differences  calculated 
by  using  the  current  predictions.  This  procedure  gives  the  on-line  version  of  TD(A): 

Vt+i(«)  =  Vt(i)  +  a,A.(it)  £(7A)‘-fcXt(fc)  U9) 

k=0 

where  Xl(Ar)  is  the  indicator  variable  of  whether  state  i  was  visited  at  kth  step  (of  a  sequence). 
Note  that  the  sum  contains  the  effect  of  the  modifications  or  activity  traces  initiated  at  past  time 
steps.  Moreover,  it  is  important  to  note  that  in  this  case  the  theoretically  desirable  properties 
of  the  estimates  derived  earlier  may  hold  only  asymptotically  (see  the  convergence  proof  in  the 
next  section). 

In  the  absorbing  case  the  estimates  Vt(i)  can  also  be  updated  off-line,  that  is,  after  a 
complete  sequence  has  been  observed.  The  learning  rule  for  this  case  is  derived  simply  from 
collecting  the  correction  traces  initiated  at  each  step  of  the  sequence.  More  concisely,  the  total 
correction  is  the  sum  of  individual  correction  traces  illustrated  in  eq.  (18).  This  results  in  the 
batch  learning  rule 

W»)  =  Vn(i)  +  an  f;  An(t<)  £(7A rkx,(k)  (20) 

t=l  k=0 

where  the  (m  +  l)th  step  is  the  termination  state. 

We  note  that  the  above  derivation  of  the  TD(A)  algorithm  corresponds  to  the  specific  choice 
of  a  linear  representation  for  the  predictors  V)(i)  (see,  e.g.,  Dayan,  1992).  Learning  rules  for 
other  representations  can  be  obtained  using  gradient  descent  but  these  are  not  considered  here. 
In  practice  TD(A)  is  usually  applied  to  an  absorbing  chain  thus  allowing  the  use  of  either  the 
batch  or  the  on-line  version  but  the  latter  is  usually  preferred. 


Convergence  of  TD(A) 

As  we  are  interested  in  strong  forms  of  convergence  we  need  to  modify  the  algorithm  slightly. 
The  learning  rate  parameters  an  are  replaced  by  an(i)  which  satisfy  £n  an(i)  =  oo  and 
£na2(i)  <  oo  uniformly  w.p.l.  These  parameters  allow  asynchronous  updating  and  they 
can,  in  general,  be  random  variables.  The  convergence  of  the  algorithm  is  guaranteed  by  the 
following  theorem  which  is  an  application  of  Theorem  1 . 

Theorem  3  For  any  finite  absorbing  Markov  chain,  for  any  distribution  of  starting  states  with 
no  inaccessible  states,  and  for  any  distributions  of  the  costs  with  finite  variances  the  TD(\ ) 
algorithm  given  by 

1) 


WO  =  Vn(i)  +  MOD*.  +  yVnih+i)  -  vn(it)]  D7A)‘-fc*(*) 


t=l 


k=  1 
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2) 


Vf+i(t)  =  Vf(»)  +  Qt(*)[c„  +  lVt{ii+ 1)  -  Vt{it)\  4X.(t) 

k=l 


converges  to  the  optimal  predictions  w.p.l  provided  a„(z)  =  oo  and  EnQ«(»)  <  00  uniformly 
w.p.l  and  7,  A  G  [0,1]  urct/i  7 A  <  1. 

Proof  for  (1):  Using  the  ideas  described  in  the  previous  section  the  learning  rule  can  be 
written  as 


Vn+1(i)  -  ^(i)  +  an(0(C.(0-g7^jjV.(i)] 

m(«) 


G”(*)  - 


where  V^(z;  k)  is  an  estimate  calculated  at  the  kth  occurence  of  state  *  in  a  sequence  and  for 
mathematical  convenience  we  have  made  the  transformation  an(z)  — ►  E{m(z)}an(z),  where  m(i) 
is  the  number  of  times  state  z  was  visited  during  the  sequence. 

To  apply  Theorem  1  we  subtract  V*(i),  the  optimal  predictions,  from  both  sides  of  the 
learning  equation.  By  identifying  an(i)  :=  a„(z')m(z)/E{m(z)},  /?„(*)  :=  an(t),  and  Fn(i)  := 
Gn(i)  -  V’*(z)m(z')/E{m(z)}  we  need  to  show  that  these  satisfy  the  conditions  of  Theorem  1. 
For  an(z)  and  /?n(z)  this  is  obvious.  We  begin  here  by  showing  that  Fn(i)  indeed  is  a  contraction 
mapping.  To  this  end, 


max|E{Fn(i)|  Vn}|  = 

t 

max  •  fT^TTu (i;  -  V'W  +  (yn  (*’  2)  -  vmW)  +  •  •  •  I  VB}| 

which  can  be  bounded  above  by  using  the  relation 
\E{Vn*(i;k)-V(i)\Vn}\ 

<  E  {  |E {Vnx(i;  k)  -  V(i)  |  m(i)  >  k,  Vn}\ 9(m(i)  -  k)  \  Vn} 

<  P{m(i )  >  k}\E{Vn\i)  -  V'(i )  |  Vn}\ 

<  ~fP{m(i)  >  /:}  max|V„(z)  -  F*(z)| 

I 


where  0(x)  =  Oifx  <  0  and  1  otherwise.  Here  we  have  also  used  the  fact  that  »?(•>  *  a 
contraction  mapping  independent  of  possible  discounting.  As  ■P{m(0  >  k}  =  E{m(i)}  we 
finally  get 

max|E{F„(z)  |  Vn}\  <  7  max  |K*(»)  —  V^z)! 

t  t 

The  variance  of  Fn(z)  can  be  seen  to  be  bounded  by 

E{m4}max|Vn(z)|2 

t 

For  any  absorbing  Markov  chain  the  convergence  to  the  terminal  state  is  geometric  and  thus 
for  every  finite  k,  E{mfc}  <  C(k),  implying  that  the  variance  of  Fn(z)  is  within  the  bounds  of 
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theorem  1.  As  Theorem  1  is  now  applicable  we  can  conclude  that  the  batch  version  of  TD(A) 
converges  to  the  optimal  predictions  w.p.l.  □ 

Proof  for  (2)  Th  proof  for  the  on-line  version  is  achieved  by  showing  that  the  effect 
of  the  on-line  upd.~‘ing  vanishes  in  the  limit  thereby  forcing  the  two  versions  to  be  equal 
asymptotically  We  view  the  on-line  version  as  a  batch  algorithm  in  which  the  updates  are 
made  after  each  complete  sequence  but  are  made  in  such  a  manner  so  as  to  be  equal  to  those 
made  on-line. 

Define  C?n(i)  =  Gn(i )  +  Rn(i)  to  be  the  new  batch  estimate  where  Rn(i)  is  the  difference 
between  the  on-line  and  batch  estimates.  We  define  the  new  batch  learning  parameters  to 
be  the  maxima  over  a  sequence,  that  is  an(t)  =  maxt6s  c*t(i).  Now  Rn(i)  consists  of  terms 
proportional  to 

the  expected  value  of  which  can  be  bounded  by  A  =  2  ||  Vn  -  V*  ||.  Assuming  that  7A  <  1 
(which  implies  that  the  multipliers  of  the  above  terms  are  bounded)  we  can  get  an  upper  bound 
for  the  expected  value  of  the  correction  Rn(i).  Let  us  define  Rn,t  to  be  the  expected  difference 
between  the  on-line  estimate  after  t  steps  and  the  first  t  terms  of  the  batch  estimate.  We  can 
bound  Rn,t(i)  readily  by  the  update  rule  resulting  in  the  iteration 

II  Rn.t+ 1  ll<|l  an  II  C(A+  ||  ^  II) 

where  f?„,„(i)  =  E{i2n(i)  |  Vn},  R„,o(*)  =  0,  and  C  is  some  constant.  Since  ||  a„  ||  goes  to  zero 
w.p.l  the  above  iteration  implies  that  ||  Rn,n  ||— ►  0  w.p.l  giving 

max|E{-R„(i)  |  Vn}\  <  Cn  max \Vn(i)  -  r*(i)| 

»  t 

where  Cn  — > •  0  w.p.l.  Therefore  using  the  results  for  the  batch  algorithm,  f^(t)  =  G'n(i)  - 
V"(i)m(i)/E{m(t)}  satisfies 

max  |E{/n(i)}|  <  (7  +  C„)  max  |V^(.)  -  V’(i)\ 

l  l 

where  for  large  n  (7  +  Cn)  <  7'  <  1  w.p.l.  The  variance  of  Rn(i)  and  thereby  that  of  F"n(i)  are 
within  the  bounds  of  theorem  1  by  linearity.  This  completes  the  proof.  □ 


Conclusions 

In  this  paper  we  have  extended  results  from  stochastic  approximation  theory  to  cover  asyn¬ 
chronous  relaxation  processes  which  have  a  contraction  property  with  respect  to  some  maximum 
norm  (Theorem  1).  This  new  class  of  converging  iterative  processes  is  shown  to  include  both 
the  Q-learning  and  TD(A)  algorithms  in  either  their  on-line  or  batch  versions.  We  note  that 
the  convergence  of  the  on-line  version  of  TD(A)  has  not  been  shown  previously.  We  also  wish 
to  emphasize  the  simplicity  of  our  results.  The  convergence  proofs  for  Q-learning  and  TD(A) 
utilize  only  high-level  statistical  properties  of  the  estimates  used  in  these  algorithms  and  do  not 
rely  on  constructions  specific  to  the  algorithms.  Our  approach  also  sheds  additional  light  on 
the  similarities  between  Q-learning  and  TD(A). 

Although  Theorem  1  is  readily  applicable  to  DP-based  learning  schemes,  the  theory  of 
Dynamic  Programming  is  important  only  for  its  characterization  of  the  optimal  solution  and 
for  a  contraction  property  needed  in  applying  the  theorem.  The  theorem  can  be  applied  to 
iterative  algorithms  of  different  types  as  well. 
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Finally  we  note  that  Theorem  1  can  be  extended  to  cover  processes  that  do  not  show  the 
usual  contraction  property  thereby  increasing  its  applicability  to  algorithms  of  possibly  more 
practical  importance. 

Proof  of  Theorem  1 

In  this  section  we  provide  a  detailed  proof  of  the  theorem  on  which  the  convergence  proofs  for 
Q- learning  and  TD(A)  were  based.  We  introduce  and  prove  three  essential  lemmas,  which  will 
also  help  to  clarify  ties  to  the  literature  and  the  ideas  behind  the  theorem,  followed  by  the  proof 
of  Theorem  1.  The  notation  |j  •  ||w=  maxr  |  •  /W(x)|  will  be  used  in  what  follows. 

Lemma  1  A  random  process 

wn+i(x)  =  (1  -  an(x))wn(x)  +  0n(x)rn(x). 

converges  to  zero  with  probability  one  if  the  following  conditions  are  satisfied: 

1)  En«n(i)  =  00,  £na£(x)  <  00,  £n /3„(x)  =  00,  ar»d£n/3*(x)  <  oo  uniformly  w.p.l. 

2)  E{rn(x)|i>n}  =  0  and  E{r^(x)|Pn}  <  C  w.p.l,  where 

Pn  —  {^ni  wn— 1  ?  •  •  •  i  rn— 1 1  rn— 2>  •  •  •  >  Qn— 1 »  On_2,  .  .  • , /?n-l » /^n-2»  •  •  •} 

All  the  random  variables  are  allowed  to  depend  on  the  past  Pn. 

Proof.  Except  for  the  appearance  of  /3„(x)  this  is  a  standard  result.  With  the  above 
definitions  convergence  follows  directly  from  Dvoretzky’s  extended  theorem  (Dvoretzky,  1956). 

Lemma  2  Consider  a  random  process  Xn+i(x)  =  Gn(Xn,x),  where 

Gn(0Xn,x)  =  0Gn(Xn,x) 

Let  us  suppose  that  if  we  kept  ||  Xn  j|  bounded  by  scaling,  then  Xn  would  converge  to  zero  w.p.l. 
This  assumption  is  sufficient  to  guarantee  that  the  original  process  converges  to  zero  w.p.l. 

Proof.  Note  that  the  scaling  of  Xn  at  any  point  of  the  iteration  corresponds  to  having 
started  the  process  with  scaled  Xo-  Fix  some  constant  C.  If  during  the  iteration,  ||  Xn  || 
increases  above  C,  then  Xn  is  scaled  so  that  |j  Xn  ||=  C.  By  the  assumption  then  this  process 
must  converge  w.p.l.  To  show  that  the  net  effect  of  the  corrections  must  stay  finite  w.p.l  we 
note  that  if  |j  Xn  ||  converges  then  for  any  e  >  0  there  exists  Mt  such  that  ||  Xn  ||<  c  <  C  for 
all  n  >  Afe  with  probability  at  least  1  -  e.  But  this  implies  that  the  iteration  stays  below  C 
after  Me  and  converges  to  zero  without  any  further  corrections.  □ 

Lemma  3  A  stochastic  process  An+1(x)  =  (1  -  a(z))An(x)  +  7/?n(x)  ||  Xn  ||  converges  to  zero 
w.p.l  provided 

1)  x  €  5,  where  S  is  a  finite  set. 

2)  En«n(x)  =  OO,  £na*(x)  <  00,  Zn  Pn(x)  =  OO,  En^n(*)  <  °C,  and  E{/?n(*)}  < 
E{a„(x)}  uniformly  w.p.l. 
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Proof.  Essentially  the  proof  is  an  application  of  Lemma  2.  To  this  end,  assume  that  we 
keep  ||  Xn  ||  <  Ci  by  scaling  which  allows  the  iterative  process  to  be  bounded  by 

|*n+t(*)l  <  (1  -  an(*))|*n(*)|  +  7/*n(*)C| 

This  is  linear  in  |X„(x)|  and  can  be  easily  shown  to  converge  w.p.l  to  some  X*(x),  where 
II  Xm  ||<  7C1.  Hence,  for  small  enough  e,  there  exists  Afi(c)  such  that  ||  Xn  ||<  Ci/(1  +  «) 
for  all  n  >  M\(e)  with  probability  at  least  pi(f).  With  probability  Pi(()  the  procedure  can  be 
repeated  for  C2  =  Ci/(  1  +  e).  Continuing  in  this  manner  and  choosing  pjt(e)  so  that  n*P*(6) 
goes  to  one  as  e  — *•  0  we  obtain  the  w.p.l  convergence  of  the  bounded  iteration  and  Lemma  2 
can  be  applied.  □ 

Theorem  1  A  .random  iterative  process  A„+i(x)  =  (1  -  a„(x))An(x)  +  fin(x)Fn(x)  converges 
to  zero  w.p.l  under  the  following  assumptions: 

1 )  The  state  space  is  finite. 

2)  En«n(i)  =  00,  £na*(x)  <  0°.  EJn(s)  =  oo,  £„  ft(x)  <  oo,  and  E{/?n(x)|Pn}  < 
E{an(x)|Pn}uni/orm/y  w.p.l. 

3)  ||  E{F„(x)|Pn}  || w<  7  II  An  || w >  where  7  6  (0,1). 

4)  Var{F„(x)|Pn}  <  C(l+  ||  A„  Hw)2)  where  C  is  some  constant. 

Here  Pn  =  stands  for  the  past  at  step  n.  Fn{x), 

an(x)  and  /3n(x)  are  allowed  to  depend  on  the  past  insofar  as  the  above  conditions  remain  valid. 
The  notation  ((  •  (|vv  refers  to  some  weighted  maximum  norm. 

Proof.  By  defining  rn(x)  =  Fn(x)  —  E{F„(x)|Pn}  we  can  decompose  the  iterative  process 
into  two  parallel  processes  given  by 

^n+i(«)  =  (1  -  Qn(x))^„(x)  + /3„(x)E{Fn(x)|Pn} 

w.  i(x)  =  (1  -  an(x))ton(x)  + /3n(x)rn(x)  (21) 

where  A„(x)  =  <5„(x)  +  wn(x).  Dividing  the  equations  by  W(x)  for  each  x  and  denoting 
^,(x)  =  6n(x)/W(x),  w'n(x)  =  wn(x)/W(x),  and  r*(x)  =  rn(x)/W(x)  we  can  bound  the  S'n 
process  by  assumption  3)  and  rewrite  the  equation  pair  as 

lCl(*)l  <  (1  -  «»(*))l*n(x)|  +  7/?n(x)  II  |*'|  +  wn  || 

=  (l-°n(*))«’n(*)  +  7/?n(®)ri,(x) 

Assume  for  a  moment  that  the  A„  process  stays  bounded.  Then  the  variance  of  rl(*)  is 
bounded  by  some  constant  C  and  thereby  wn  converges  to  zero  w.p.l  according  to  Lemma  1. 
Hence,  there  exists  M  such  that  for  all  n  >  Af  ||  u)n  ||<  c  with  probability  at  least  1  —  e.  This 
implies  that  the  6n  process  can  be  further  bounded  by 

l<+i(*)l  <  (1  -  M*))l*n(*)l  +  T0n{x)  ||  K  +  «  II 

with  probability  >  1  -  e.  If  we  choose  C  such  that  7 (C  +  1)/C  <  1  then  for  j|  S'n  ||>  Ce 

7K  +  «H<7(C  +  1)/C  II  <|| 
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and  the  process  defined  by  this  upper  bound  converges  to  zero  w.p.l  by  Lemma  3.  Thus  ||  6'n  || 
converges  w.p.l  to  some  value  bounded  by  Ce  which  guarantees  the  w.p.l  convergence  of  the 
original  process  under  the  boundedness  assumption. 

By  assumption  (4)  r'n(x)  can  be  written  as  (1+  ||  6n  +  wn  ||)sTl(i),  where  E{s*(x)|Pn}  <  C. 
Let  us  now  decompose  wn  as  un  +  vn  with 

«n+l(*)  =  (1  -  a„(*))tl„(x)  +  7 &n{x)  II  +  Vn  ||  S„(x) 

and  vn  converges  to  zero  w.p.l  by  Lemma  1.  Again  by  choosing  C  such  that  7 (C  +  1  )/C  <  1 
we  can  bound  the  S'n  and  un  processes  for  ||  fi‘n  +  un  ||>  Ct.  The  pair  (6'n,  ttn)  is  then  a 
scale  invariant  process  whose  bounded  version  was  proven  earlier  to  converge  to  zero  w.p.l  and 
therefore  by  Lemma  2  it  too  converges  to  zero  w.p.l.  This  proves  the  w.p.l  convergence  of  the 
triple  S'n,  un,  and  vn  bounding  the  original  process.  □ 
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